In [3]:
# Load Libraries - Make sure to run this cell!
import pandas as pd
import numpy as np
import re
from collections import Counter
from sklearn import feature_extraction, tree, model_selection, metrics
from yellowbrick.features import Rank2D
from yellowbrick.features import RadViz
from yellowbrick.features import ParallelCoordinates
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib
%matplotlib inline
This worksheet is a step-by-step guide on how to detect domains that were generated using "Domain Generation Algorithm" (DGA). We will walk you through the process of transforming raw domain strings to Machine Learning features and creating a decision tree classifer which you will use to determine whether a given domain is legit or not. Once you have implemented the classifier, the worksheet will walk you through evaluating your model.
Overview 2 main steps:
DGA - Background
"Various families of malware use domain generation
algorithms (DGAs) to generate a large number of pseudo-random
domain names to connect to a command and control (C2) server.
In order to block DGA C2 traffic, security organizations must
first discover the algorithm by reverse engineering malware
samples, then generate a list of domains for a given seed. The
domains are then either preregistered, sink-holed or published
in a DNS blacklist. This process is not only tedious, but can
be readily circumvented by malware authors. An alternative
approach to stop malware from using DGAs is to intercept DNS
queries on a network and predict whether domains are DGA
generated. Much of the previous work in DGA detection is based
on finding groupings of like domains and using their statistical
properties to determine if they are DGA generated. However,
these techniques are run over large time windows and cannot be
used for real-time detection and prevention. In addition, many of
these techniques also use contextual information such as passive
DNS and aggregations of all NXDomains throughout a network.
Such requirements are not only costly to integrate, they may not
be possible due to real-world constraints of many systems (such
as endpoint detection). An alternative to these systems is a much
harder problem: detect DGA generation on a per domain basis
with no information except for the domain name. Previous work
to solve this harder problem exhibits poor performance and many
of these systems rely heavily on manual creation of features;
a time consuming process that can easily be circumvented by
malware authors..."
[Citation: Woodbridge et. al 2016: "Predicting Domain Generation Algorithms with Long Short-Term Memory Networks"]
A better alternative for real-world deployment would be to use "featureless deep learning" - We have a separate notebook where you can see how this can be implemented!
However, let's learn the basics first!!!
In [2]:
## Load data
df = pd.read_csv('../../data/dga_data_small.csv')
df.drop(['host', 'subclass'], axis=1, inplace=True)
print(df.shape)
df.sample(n=5).head() # print a random sample of the DataFrame
Out[2]:
In [3]:
df[df.isDGA == 'legit'].head()
Out[3]:
In [4]:
# Google's 10000 most common english words will be needed to derive a feature called ngrams...
# therefore we already load them here.
top_en_words = pd.read_csv('../../data/google-10000-english.txt', header=None, names=['words'])
top_en_words.sample(n=5).head()
# Source: https://github.com/first20hours/google-10000-english
Out[4]:
Option 1 to derive Machine Learning features is to manually hand-craft useful contextual information of the domain string. An alternative approach (not covered in this notebook) is "Featureless Deep Learning", where an embedding layer takes care of deriving features - a huge step towards more "AI".
Previous academic research has focused on the following features that are based on contextual information:
List of features:
H_entropy
function provided vowel_consonant_ratio
function providedngram
functions providedTasks:
Split into A and B parts, see below...
Please run the following function cell and then continue reading the next markdown cell with more details on how to derive those features. Have fun!
In [5]:
def H_entropy (x):
# Calculate Shannon Entropy
prob = [ float(x.count(c)) / len(x) for c in dict.fromkeys(list(x)) ]
H = - sum([ p * np.log2(p) for p in prob ])
return H
def vowel_consonant_ratio (x):
# Calculate vowel to consonant ratio
x = x.lower()
vowels_pattern = re.compile('([aeiou])')
consonants_pattern = re.compile('([b-df-hj-np-tv-z])')
vowels = re.findall(vowels_pattern, x)
consonants = re.findall(consonants_pattern, x)
try:
ratio = len(vowels) / len(consonants)
except: # catch zero devision exception
ratio = 0
return ratio
Please try to derive a new pandas 2D DataFrame with a new column for each of feature. Focus on length
, digits
, entropy
and vowel-cons
here. Also make sure to encode the isDGA
column as integers. pandas.Series.str, pandas.Series.replace and pandas.Series,apply can be very helpful to quickly derive those features. Functions you need to apply here are provided in above cell.
The ngram
is a bit more complicated, see next instruction cell to add this feature...
In [6]:
# derive features
df['length'] = df.domain.str.len()
df['digits'] = df.domain.str.count('[0-9]')
df['entropy'] = df.domain.apply(H_entropy)
df['vowel-cons'] = df.domain.apply(vowel_consonant_ratio)
# encode strings of target variable as integers
df.isDGA = df.isDGA.replace(to_replace = 'dga', value=1)
df.isDGA = df.isDGA.replace(to_replace = 'legit', value=0)
print(df.isDGA.value_counts())
# check intermediate 2D pandas DataFrame
df.sample(n=5).head()
Out[6]:
Finally, let's tackle the ngram feature. There are multiple steps involved to derive this feature. Here in this notebook, we use an implementation outlined in the this academic paper Schiavoni 2014: "Phoenix: DGA-based Botnet Tracking and Intelligence" - see section: Linguistic Features.
Steps involved:
top_en_words
in this notebook). Now we run the ngrams
functions on a list of all these words. The output here is a list that contains ALL 1-grams, bi-grams and tri-grams of these 10000 most common english words.Counter
function from collections to derive a dictionary d
that contains the counts of all unique 1-grams, bi-grams and tri-grams.ngram_feature
function will do the core magic. It takes your domain as input, splits it into ngrams (n is a function parameter) and then looks up these ngrams in the english dictionary d
we derived in step 2. Function returns the normalized sum of all ngrams that were contained in the english dictionary. For example, running ngram_feature('facebook', d, 2)
will return 171.28 (this value is just like the one published in the Schiavoni paper).average_ngram_feature
wraps around ngram_feature
. You will use this function as your task is to derive a feature that gives the average of the ngram_feature for n=1,2 and 3. Input to this function should be a simple list with entries calling ngram_feature
with n=1,2 and 3, hence a list of 3 ngram_feature results. average_ngram_feature
to you domain column in the DataFrame thereby adding ngram
to the df.domain
column from your DataFrame.Please run the following function cell and then write your code in the following cell.
In [7]:
# ngrams: Implementation according to Schiavoni 2014: "Phoenix: DGA-based Botnet Tracking and Intelligence"
# http://s2lab.isg.rhul.ac.uk/papers/files/dimva2014.pdf
def ngrams(word, n):
# Extract all ngrams and return a regular Python list
# Input word: can be a simple string or a list of strings
# Input n: Can be one integer or a list of integers
# if you want to extract multipe ngrams and have them all in one list
l_ngrams = []
if isinstance(word, list):
for w in word:
if isinstance(n, list):
for curr_n in n:
ngrams = [w[i:i+curr_n] for i in range(0,len(w)-curr_n+1)]
l_ngrams.extend(ngrams)
else:
ngrams = [w[i:i+n] for i in range(0,len(w)-n+1)]
l_ngrams.extend(ngrams)
else:
if isinstance(n, list):
for curr_n in n:
ngrams = [word[i:i+curr_n] for i in range(0,len(word)-curr_n+1)]
l_ngrams.extend(ngrams)
else:
ngrams = [word[i:i+n] for i in range(0,len(word)-n+1)]
l_ngrams.extend(ngrams)
# print(l_ngrams)
return l_ngrams
def ngram_feature(domain, d, n):
# Input is your domain string or list of domain strings
# a dictionary object d that contains the count for most common english words
# finally you n either as int list or simple int defining the ngram length
# Core magic: Looks up domain ngrams in english dictionary ngrams and sums up the
# respective english dictionary counts for the respective domain ngram
# sum is normalized
l_ngrams = ngrams(domain, n)
# print(l_ngrams)
count_sum=0
for ngram in l_ngrams:
if d[ngram]:
count_sum+=d[ngram]
try:
feature = count_sum/(len(domain)-n+1)
except:
feature = 0
return feature
def average_ngram_feature(l_ngram_feature):
# input is a list of calls to ngram_feature(domain, d, n)
# usually you would use various n values, like 1,2,3...
return sum(l_ngram_feature)/len(l_ngram_feature)
l_en_ngrams = ngrams(list(top_en_words['words']), [1,2,3])
d = Counter(l_en_ngrams)
from six.moves import cPickle as pickle
with open('../../data/d_common_en_words' + '.pickle', 'wb') as f:
pickle.dump(d, f, pickle.HIGHEST_PROTOCOL)
In [8]:
df['ngrams'] = df.domain.apply(lambda x: average_ngram_feature([ngram_feature(x, d, 1),
ngram_feature(x, d, 2),
ngram_feature(x, d, 3)]))
# check final 2D pandas DataFrame containing all final features and the target vector isDGA
df.sample(n=5).head()
Out[8]:
In [9]:
df_final = df
df_final = df_final.drop(['domain'], axis=1)
df_final.to_csv('../../data/dga_features_final_df.csv', index=False)
df_final.head()
Out[9]:
In [4]:
df_final = pd.read_csv('../../data/dga_features_final_df.csv')
print(df_final.isDGA.value_counts())
df_final.head()
Out[4]:
At this point, we've created a dataset which has many features that can be used for classification. Using YellowBrick, your final step is to visualize the features to see which will be of value and which will not.
First, let's create a Rank2D visualizer to compute the correlations between all the features. Detailed documentation available here: http://www.scikit-yb.org/en/latest/examples/methods.html#feature-analysis
In [8]:
feature_names = ['length','digits','entropy','vowel-cons','ngrams']
features = df_final[feature_names]
target = df_final.isDGA
In [10]:
visualizer = Rank2D(algorithm='pearson',features=feature_names)
visualizer.fit_transform( features )
visualizer.poof()
Now let's use a Seaborn pairplot as well. This will really show you which features have clear dividing lines between the classes. Docs are available here: http://seaborn.pydata.org/generated/seaborn.pairplot.html
In [15]:
sns.pairplot(df_final, hue='isDGA', vars=feature_names)
Out[15]:
Finally, let's try making a RadViz of the features. This visualization will help us see whether there is too much noise to make accurate classifications.
In [21]:
X = df_final[feature_names].as_matrix()
y = df_final.isDGA.as_matrix()
radvizualizer = RadViz(classes=['Benign','isDga'], features=feature_names)
radvizualizer.fit_transform( X, y)
radvizualizer.poof()
In [ ]: